Sample of Data¶

Methodology¶

Data has been pulled from the original database. The samples employeeHistory_sample.csv and companies_sample.csv are informed by deals_sample.csv, where they only inclucde the co_id's within deal_sample.csv.

deals_sample.csv has the following constraint...

  • Deal Vintage Range: $(2010, 2020]$

...to minimize the size of each CSV to be under 100 MB. The data included in the CSV files represent approximately 30-50% of the data from the original database.

Analytics¶

Simple analysis is included in each section which are msot relevant to the Exploratory Analysis. Minimal transformations are exercised.

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
In [2]:
# Load in Data from CSV
companies = pd.read_csv('data/companies_sample.csv',encoding='utf-8').convert_dtypes()
deals = pd.read_csv('data/deals_sample.csv', encoding='utf-8').convert_dtypes()
emp_hist = pd.read_csv('data/employeeHistory_sample.csv', encoding='utf-8').convert_dtypes()

Companies - Sample¶

In [3]:
companies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137827 entries, 0 to 137826
Data columns (total 17 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   co_id                       137827 non-null  string 
 1   co_name                     137827 non-null  string 
 2   co_hq                       137251 non-null  string 
 3   co_city                     137249 non-null  string 
 4   co_state                    137251 non-null  string 
 5   co_year_founded             121593 non-null  Int64  
 6   co_revenue                  35463 non-null   Float64
 7   co_business_status          131423 non-null  string 
 8   co_financing_status         137827 non-null  string 
 9   co_ownership_status         137827 non-null  string 
 10  co_primary_industry         137554 non-null  string 
 11  co_primary_industry_group   137554 non-null  string 
 12  co_primary_industry_sector  137554 non-null  string 
 13  co_industries               137827 non-null  string 
 14  co_verticals                137827 non-null  string 
 15  co_keywords                 137827 non-null  string 
 16  co_description              137664 non-null  string 
dtypes: Float64(1), Int64(1), string(15)
memory usage: 18.1 MB
In [4]:
companies.head()
Out[4]:
co_id co_name co_hq co_city co_state co_year_founded co_revenue co_business_status co_financing_status co_ownership_status co_primary_industry co_primary_industry_group co_primary_industry_sector co_industries co_verticals co_keywords co_description
0 100002-34 Acton Marketing Lincoln, NE Lincoln NE 1982 <NA> Generating Revenue Corporate Backed or Acquired Acquired/Merged Media and Information Services (B2B) Commercial Services Business Products and Services (B2B) {"Media and Information Services (B2B)","Other... {} {"crosses selling","directing marketing","fina... Provider of direct marketing services. The com...
1 100003-15 Premama Providence, RI Providence RI 2011 5.5 Generating Revenue Venture Capital-Backed Privately Held (backing) Specialty Retail Retail Consumer Products and Services (B2C) {"Other Pharmaceuticals and Biotechnology",Pha... {HealthTech,"LOHAS & Wellness"} {"fertility care","maternity food","postnatal ... Online retailer of medicine supplements intend...
2 100004-68 Northeast Agencies Williamsville, NY Williamsville NY 1984 <NA> Generating Revenue Corporate Backed or Acquired Acquired/Merged Insurance Brokers Insurance Financial Services {"Insurance Brokers"} {} {"insurance agent","insurance policy","insuran... Wholesaler of insurance policies. The company ...
3 100004-86 Portamedic Bernards, NJ Bernards NJ <NA> <NA> Generating Revenue Corporate Backed or Acquired Acquired/Merged Other Healthcare Other Healthcare Healthcare {"Other Healthcare","Other Healthcare Services"} {} {"clinical research","health information servi... Provider of health information services. The c...
4 100005-67 BCV Social Chicago, IL Chicago IL 2009 10.9 Generating Revenue Formerly VC-backed Acquired/Merged (Operating Subsidiary) Media and Information Services (B2B) Commercial Services Business Products and Services (B2B) {"Media and Information Services (B2B)"} {"Marketing Tech",TMT} {"hospitality industry","manage crises","monit... Provider of social media in the hospitality in...
In [5]:
px.histogram(
    companies[['co_year_founded','co_primary_industry_sector']].dropna(),
    x='co_year_founded',
    color='co_primary_industry_sector',
    histnorm='percent'
).update_layout(
    xaxis={'title_text':'year'},
    yaxis={'title_text':'%'},
    title_text='% Companies per Years Founded',
    legend_title='primary industry sector'
)
In [6]:
px.box(
    companies[['co_revenue','co_primary_industry_sector']].dropna(),
    x='co_revenue',
    y='co_primary_industry_sector',
    log_x=True
).update_layout(
    xaxis={'title_text':'revenue (log)'},
    yaxis = {'title_text':'primary industry sector'},
    title_text='Revenue per Primary Industry Sector<br><sup>Revenue is measured in millions'
)

Deals - Sample¶

In [7]:
deals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235895 entries, 0 to 235894
Data columns (total 24 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   deal_id              235895 non-null  string 
 1   co_id                235895 non-null  string 
 2   co_name              235895 non-null  string 
 3   co_hq                235895 non-null  string 
 4   co_city              235893 non-null  string 
 5   co_state             235895 non-null  string 
 6   co_verticals         235895 non-null  string 
 7   deal_date            235895 non-null  string 
 8   deal_vintage         235895 non-null  Int64  
 9   deal_number          235895 non-null  Int64  
 10  deal_series          36216 non-null   string 
 11  deal_vc_round        55037 non-null   string 
 12  deal_type            235895 non-null  string 
 13  deal_premoney        52794 non-null   Float64
 14  deal_postmoney       85425 non-null   Float64
 15  deal_size            235895 non-null  Float64
 16  deal_pct_acq         82048 non-null   Float64
 17  deal_investor_count  86224 non-null   Int64  
 18  investor_id          235895 non-null  string 
 19  investor_name        235895 non-null  string 
 20  investor_gp_id       235895 non-null  string 
 21  investor_gp          235895 non-null  string 
 22  fund_id              235895 non-null  string 
 23  fund_name            235895 non-null  string 
dtypes: Float64(4), Int64(3), string(17)
memory usage: 44.8 MB
In [8]:
deals.head()
Out[8]:
deal_id co_id co_name co_hq co_city co_state co_verticals deal_date deal_vintage deal_number ... deal_postmoney deal_size deal_pct_acq deal_investor_count investor_id investor_name investor_gp_id investor_gp fund_id fund_name
0 175465-27T 434316-61 Renavotio Infratech (PINX: RIII) Tulsa, OK Tulsa OK {} 2020-12-31 2020 2 ... <NA> 8.33 <NA> <NA> {} {} {} {} {} {}
1 169689-07T 53951-32 ASG Technologies Naples, FL Naples FL {TMT} 2020-12-31 2020 11 ... <NA> 0.76 <NA> <NA> {} {} {} {} {} {}
2 160795-09T 181488-43 Origin (Industrial Supplies and Parts) San Francisco, CA SanFrancisco CA {"3D Printing","Advanced Manufacturing",TMT} 2020-12-31 2020 5 ... 97.11 97.11 100.0 1 {54544-15} {"Stratasys (NAS: SSYS)"} {"54544-15-Yoav Zeif"} {"Yoav Zeif"} {} {}
3 169319-17T 266103-37 Forte (Software Development Applications) San Francisco, CA SanFrancisco CA {Cryptocurrency/Blockchain,Gaming} 2020-12-31 2020 3 ... <NA> 2.56 <NA> 1 {439103-62} {"TriplePoint Private Venture Credit BDC"} {439103-62-None} {NULL} {} {}
4 172449-01T 466030-45 Kado New York, NY NewYork NY {Industrials,Mobile,SaaS,TMT} 2020-12-31 2020 1 ... <NA> 0.6 <NA> <NA> {} {} {} {} {} {}

5 rows × 24 columns

In [9]:
px.bar(
    deals.merge(companies[['co_id','co_primary_industry_sector']], on='co_id').groupby(['deal_vintage','co_primary_industry_sector'])['co_id'].nunique() \
    .rename('n_deals').reset_index(),
    x='deal_vintage',
    y='n_deals',
    color='co_primary_industry_sector',
    barmode='group'
).update_layout(
    title_text='N Deals per Year',
    xaxis={'title_text':'deal vintage'},
    yaxis={'title_text':'# deals'},
    legend_title='primary industry sector'
)
In [10]:
temp = deals.merge(companies[['co_id','co_primary_industry_sector']], on='co_id').groupby(['deal_vc_round', 'co_primary_industry_sector'])['co_id'].nunique() \
            .rename('n_deals').reset_index()

temp.loc[:,'deal_vc_round_num'] = temp['deal_vc_round'].str.split(r"(th|st|rd|nd) ").str[0].apply(lambda x: 0 if x == 'Angel' else x).astype('int64')

px.histogram(
        temp,
        x='deal_vc_round_num',
        y='n_deals',
        color='co_primary_industry_sector',
        barmode='group',
        histnorm='percent'
).update_layout(
    title_text='% Deals per VC Round<br><sup>0 is an angel round',
    xaxis={'title_text':'VC Round'},
    yaxis={'title_text':'% deals'},
    legend_title='primary industry sector'
)
In [11]:
temp = deals.merge(companies[['co_id','co_primary_industry_sector']], on='co_id')[['deal_vc_round', 'deal_size', 'co_primary_industry_sector']].dropna()
temp.loc[:,'deal_vc_round_num'] = temp['deal_vc_round'].str.split(r"(th|st|rd|nd) ").str[0].apply(lambda x: 0 if x == 'Angel' else x).astype('int64')


px.histogram(
    temp.groupby(['deal_vc_round_num','co_primary_industry_sector'])['deal_size'].median().rename('median_deal_size').reset_index(),
    x='deal_vc_round_num',
    y='median_deal_size',
    color='co_primary_industry_sector',
    barmode='group',
    log_y=True
).update_layout(
    title_text='Median Deal Size per VC Round',
    xaxis={'title_text':'VC Round'},
    yaxis={'title_text':'Median Deal Size'},
    height=500
)

Employee History - Sample¶

In [12]:
"""Here, we are just separating and expanding on employee history column, co_employee_hist, which is formatted as string"""
emp_hist = emp_hist.join(emp_hist['co_employee_hist'].apply(lambda x: x[1:-1].replace('"', '').split(',')).explode().rename('expl_emp_hist')) # remove brackets and extra quotes, split into list then explode
emp_hist['emp_hist_yr'] = emp_hist['expl_emp_hist'].str.split(': ').apply(lambda x: x[0]).astype('int64')
emp_hist['emp_hist_cnt'] = emp_hist['expl_emp_hist'].str.split(': ').apply(lambda x: x[1]).astype('int64')
emp_hist.drop(columns=['expl_emp_hist', 'co_employee_hist'], inplace=True)
In [13]:
emp_hist.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 256152 entries, 0 to 83976
Data columns (total 18 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   co_id                       256152 non-null  string
 1   co_name                     256152 non-null  string
 2   co_ticker                   40836 non-null   string
 3   co_city                     255655 non-null  string
 4   co_state                    256118 non-null  string
 5   co_year_founded             248980 non-null  Int64 
 6   co_business_status          255590 non-null  string
 7   co_financing_status         256152 non-null  string
 8   co_ownership_status         256152 non-null  string
 9   co_primary_industry         256000 non-null  string
 10  co_primary_industry_group   256000 non-null  string
 11  co_primary_industry_sector  256000 non-null  string
 12  co_industries               256152 non-null  string
 13  co_verticals                256152 non-null  string
 14  co_keywords                 256152 non-null  string
 15  co_description              256081 non-null  string
 16  emp_hist_yr                 256152 non-null  int64 
 17  emp_hist_cnt                256152 non-null  int64 
dtypes: Int64(1), int64(2), string(15)
memory usage: 37.4 MB
In [14]:
emp_hist.head()
Out[14]:
co_id co_name co_ticker co_city co_state co_year_founded co_business_status co_financing_status co_ownership_status co_primary_industry co_primary_industry_group co_primary_industry_sector co_industries co_verticals co_keywords co_description emp_hist_yr emp_hist_cnt
0 100002-34 Acton Marketing <NA> Lincoln Nebraska 1982 Generating Revenue Corporate Backed or Acquired Acquired/Merged Media and Information Services (B2B) Commercial Services Business Products and Services (B2B) {"Media and Information Services (B2B)","Other... {} {"crosses selling","directing marketing","fina... Provider of direct marketing services. The com... 2014 16
1 100003-15 Premama <NA> Providence Rhode Island 2011 Generating Revenue Venture Capital-Backed Privately Held (backing) Other Consumer Non-Durables Consumer Non-Durables Consumer Products and Services (B2C) {"Other Consumer Non-Durables","Other Pharmace... {E-Commerce,FemTech,HealthTech,"LOHAS & Wellne... {"fertility care","maternity food","postnatal ... Online retailer of medical supplements intende... 2015 5
1 100003-15 Premama <NA> Providence Rhode Island 2011 Generating Revenue Venture Capital-Backed Privately Held (backing) Other Consumer Non-Durables Consumer Non-Durables Consumer Products and Services (B2C) {"Other Consumer Non-Durables","Other Pharmace... {E-Commerce,FemTech,HealthTech,"LOHAS & Wellne... {"fertility care","maternity food","postnatal ... Online retailer of medical supplements intende... 2016 3
1 100003-15 Premama <NA> Providence Rhode Island 2011 Generating Revenue Venture Capital-Backed Privately Held (backing) Other Consumer Non-Durables Consumer Non-Durables Consumer Products and Services (B2C) {"Other Consumer Non-Durables","Other Pharmace... {E-Commerce,FemTech,HealthTech,"LOHAS & Wellne... {"fertility care","maternity food","postnatal ... Online retailer of medical supplements intende... 2017 6
1 100003-15 Premama <NA> Providence Rhode Island 2011 Generating Revenue Venture Capital-Backed Privately Held (backing) Other Consumer Non-Durables Consumer Non-Durables Consumer Products and Services (B2C) {"Other Consumer Non-Durables","Other Pharmace... {E-Commerce,FemTech,HealthTech,"LOHAS & Wellne... {"fertility care","maternity food","postnatal ... Online retailer of medical supplements intende... 2020 10
In [15]:
emp_hist.describe()
Out[15]:
co_year_founded emp_hist_yr emp_hist_cnt
count 248980.000000 256152.000000 2.561520e+05
mean 1993.118263 2014.024314 2.236470e+03
std 31.089360 7.359399 2.283181e+04
min 1800.000000 1812.000000 1.000000e+00
25% 1986.000000 2010.000000 1.300000e+01
50% 2004.000000 2016.000000 5.100000e+01
75% 2013.000000 2020.000000 2.750000e+02
max 2021.000000 2022.000000 2.300000e+06
In [16]:
px.histogram(
    emp_hist.groupby(['emp_hist_yr', 'co_primary_industry_sector'])['co_id'].nunique().rename('n_companies').reset_index(),
    x='emp_hist_yr',
    y='n_companies',
    color='co_primary_industry_sector'
).update_layout(
    title_text='N Companies with Employee Records For Year X',
    xaxis={'title_text':'Year X'},
    yaxis={'title_text':'N Companies'},
    legend_title='primary industry sector'
)
In [17]:
px.histogram(
    emp_hist.groupby(['emp_hist_yr','co_primary_industry_sector'])['emp_hist_cnt'].mean().rename('mean_n_emp').reset_index(),
    x='emp_hist_yr',
    y='mean_n_emp',
    color='co_primary_industry_sector',
    barmode='group'
).update_layout(
    title_text='Average N employees per Year',
    xaxis={'title_text':'Year'},
    yaxis={'title_text':'N Employees'}
)
In [18]:
px.histogram(
    emp_hist.groupby(['emp_hist_yr','co_primary_industry_sector'])['emp_hist_cnt'].mean().rename('mean_n_emp').reset_index() \
        .query('co_primary_industry_sector != "Consumer Products and Services (B2C)"'),
    x='emp_hist_yr',
    y='mean_n_emp',
    color='co_primary_industry_sector',
    barmode='group'
).update_layout(
    title_text='Average N employees per Year<br><sup>Not incl. Consumer Products and Services (B2C)',
    xaxis={'title_text':'Year'},
    yaxis={'title_text':'N Employees'}
)
In [ ]: